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f^ ' Abstract 

^^ ' Model selection is of fundamental importance to high dimensional modeling fea- 

tured in many contemporary applications. Classical principles of model selection include 
the KuUback-Leibler divergence principle and the Bayesian principle, which lead to the 
Akaike information criterion and Bayesian information criterion when models are cor- 
rectly specified. Yet model misspecification is unavoidable when we have no knowledge 
of the true model or when we have the correct family of distributions but miss some 
Li^ ' true predictor. In this paper, we propose a family of semi-Bayesian principles for model 

C/^ ' selection in misspecified models, which combine the strengths of the two well-known 

(~| . principles. We derive asymptotic expansions of the semi-Bayesian principles in misspec- 

C^ , ified generalized linear models, which give the new semi-Bayesian information criteria 

(SIC). A specific form of SIC admits a natural decomposition into the negative maxi- 
mum quasi-log-likelihood, a penalty on model dimensionality, and a penalty on model 
misspecification directly. Numerical studies demonstrate the advantage of the newly pro- 
posed SIC methodology for model selection in both correctly specified and misspecified 
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1 Introduction 

High dimensional modeling is commonly encountered in many contemporary applications. 
The data we consider in this article is of the type {yi,Xii,--- ,Xip)f^i, where the y^'s are 
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n independent observations of the response variable Y conditional on its covariates, or ex- 
planatory variables, (xji, • • • ,Xip). When the dimensionality (or the number of covariates) 
p is large compared to the sample size n, it is desirable to produce sparse models that in- 
volve subsets of predictors. With such models one can improve the prediction accuracy and 
enhance the model interpretability. See Fan and Lv (2010) for an overview of recent devel- 
opments in high dimensional variable selection. A natural and fundamental problem that 
arises from such a task is to compare models with different sets of predictors. 

Two classical principles of model selection are the Kullback-Leibler (KL) divergence prin- 
ciple and the Bayesian principle, which lead to the Akaike information criterion (AIC) by 
Akaike (1973, 1974) and Bayesian information criterion (BIC) by Schwartz (1978), respec- 
tively, when the models are correctly specified. The AIC and BIC have been provn to be 
powerful tools for model selection in various settings, see Burnham and Anderson (1998) for a 
book-length account of these developments. Stone (1977) shows heuristically the asymptotic 
equivalence of AIC and cross-validation under the true model, where a logarithmic assess- 
ment of the performance of a predicting density is used in the cross-validation. Bozdogan 
(1987) studies asymptotic properties of the AIC and BIC, showing that asymptotically AIC 
has a positive probability of overestimating the true dimension, while BIC is asymptotically 
consistent in estimating the true model. Hall (1990) compares AIC with KL cross-validation 
in the setting of histogram density estimation. Yang and Barron (1998) study the asymp- 
totic property of model selection criteria related to the AIC and minimum description length 
principles in density estimation. Other works on model selection include the risk inflation 
criterion (Foster and George, 1994), generalized information criterion (Konishi and Kita- 
gawa, 1996), Bayesian measure using deviance information criterion (Spiegelhalter et al., 
2002; Gelman et al, 2004), model evaluation by using the absolute prediction error (Tian 
et al, 2007), tuning parameter selection in penalization method (Wang, Li and Tsai, 2007), 
parametricness index (Liu and Yang, 2009), and many variations of the AIC and BIC (see, 
e.g., Bozdogan, 1987 and 2000). 

A common strategy of parametric modeling is to choose a priori a family of distributions 
and then find a model in this family that best fits the data, often based on the maximum 
likelihood principle. It has been a common wisdom in statistics, however, that "all models 
are wrong, but some are more useful than others." Thus, in the broad sense all models 
are misspecified. A direct generalization of the parametric modeling setting, which was 
targeted by AIC, BIC, and other model selection criteria, is to choose the best model among 
a set of families of models with different dimensional parameter spaces. In the linear model 
setting, for example, one is given a set of potential predictors but does not know which 
subset of the predictors are truly useful. Many approaches (see, e.g., Tibshirani, 1996; Fan 
and Li, 2001; Zou, 2006; Candes and Tao, 2007; Fan and Lv, 2008; Hah and Miller, 2009; 
Hall, Titterington and Xue, 2009; Lv and Fan, 2009) have been proposed in the literature 



for variable selection. A common idea is to first construct a sequence of candidate linear 
models with different subsets of predictors and then choose the best one according to some 
model evaluation criterion such as AIC or BIC. However, even in this setting, the families 
of models may still not contain the true model, and neither AIC nor BIC explicitly account 
for such model misspecification. Explicitly estimating the discrepancy between the "best" 
fitted model in the "best" family and the true model can be potentially helpful. 

In this paper, we derive asymptotic expansions of several model selection principles in 
misspecified generalized linear models (GLMs) (McCullagh and Nelder, 1989). The general 
idea applies to other model settings as well. The technical results lead us to propose the new 
semi-Bayesian information criteria (SIC). In particular, a specific form of SIC can be written 
as the sum of the negative maximum quasi-log-likelihood, a penalty on model dimensionality, 
and a penalty on model misspecification directly. Most conventional information criteria have 
been derived for independent and identically distributed (i.i.d.) observations. Our results 
can also be viewed as an extension of the classical approaches to non-i.i.d. settings. 

The rest of the paper is organized as follows. In Section [2] we study the consistency and 
asymptotic normality of the quasi-maximum likelihood estimator in misspecified GLMs. We 
investigate the asymptotic expansions of the KL divergence principle and the Bayesian prin- 
ciple in Sections [3] and m respectively. In Section [5l we present the semi-Bayesian principles 
and the SIC methodology. We present several numerical examples to illustrate the finite 
sample performance of the proposed SIC methodology for model selection in misspecified 
models in Section [6i Section \7\ provides some discussions of our results and their implica- 
tions. All technical details are relegated to the Appendix. Some necessary formulas of SIC 
for three commonly used GLMs are also given in the Appendix. 

2 Technical preparation 

For completeness, we present here some asymptotic properties of the quasi-maximum likeli- 
hood estimator in misspecified GLMs with deterministic design matrices, which serve as the 
foundation for deriving asymptotic expansions of model selection principles in Sections [SHSl 
See White (1982) for a systematic treatment of the problem in the i.i.d. setting. Assume 
that the n-dimensional random vector of responses Y = {Yi, ■ ■ ■ ,y„)^ has a true unknown 
distribution Gn with density function 

n 

9n{y) =Y{gn,i{yi)^ (1) 

i=l 

where y = (yi, • • • , y-n) ■ Each observation yi of the response variable may depend on all or 
some of the p predictors Xij. Model ([T]) entails that all components of Y are independent 
but not necessarily identically distributed. 



2.1 The quasi-maximum likelihood estimator 

Consider an arbitrary subset lUT C {!,••• ,p} with d = [5Jt[ < n Ap. Define Xj = {xij : j G 
SPt)^ and X = (xi, • • • , x„)^, an n x d deterministic design matrix. Since the true model Gn 
is unknown, we choose a family of generalized linear models Fn{-, (3) as our working models, 
with density function 

n n 

fn{z, (3)dno{z) = Y[fo{zi,6i)dno{zi) = JJexp [9iZi - b{6i)] dfj,{zi), (2) 

i=l i=l 

where z = (zi, • • • , Zn) , = (^i, • • • , 0„) = X/3 with /3 € R , b{9) is a convex function, 
^0 is the Lebesgue measure, and /i is some fixed measure on R. Clearly {/o(-z, 9) : 9 G R} is 
a family of distributions in the regular exponential family and may not contain gn/s. The 
following condition is common in the GLM setting. 

Condition 1. b{9) is twice differentiable with b"{9) always positive and X has full column 
rank d. 

For notational convenience, we define two vector-valued functions h{6) = {b{9i), • • • , b{9n)) 
and ^l{e) = {b'{9i), ■■■ , b'{9n))^, and a matrix- valued function S(0) = diag{6"(6li), • • • , b"{9n)}, 
6 = (01, • • • , 9n)'^ ■ For any n-dimensional random vector Z with distribution Fn{-,I3) given 
by ©, it holds that 

EZ = /x(X/3) and cov(Z) = T.(Kp). (3) 

The density function ([2]) can be rewritten as 

Uz,P) = exp [z^X/3 - l^b(X/3)] J] T^iz^), 

where ^ denotes the Radon-Nikodym derivative. Given the observations y and X, this 
gives the quasi-log-likelihood function 

4(y, (3) = log /„(y, (3) = y^X/3 - l^b(X/3) + V log -^(y,). (4) 

The quasi-maximum likelihood estimator (QMLE) of the d-dimensional parameter vector (3 
is defined as 

3n = arg max 4(y, f3). (5) 

The following proposition follows easily from a standard strict concavity argument. 
Proposition 1. Under Condition 1, the QMLE P^ is unique. 



Clearly the QMLE /3„ is the solution to the score equation 

*n(/3) - ^^^ = X^[y - M(X/3)] = 0, (6) 

which becomes the normal equation X y = X X/3 in the linear regression model. The con- 
sistency and asymptotic normality of the maximum likelihood estimator (MLE) in correctly 
specified GLMs have been studied by Fahrmeir and Kaufmann (1985). A general theory of 
maximum likelihood estimation of misspecified models was presented by White (1982), who 
studied the case of i.i.d. observations from a general distribution and used the KL divergence 
as a measure of model misspecification. We generalize those results to misspecified GLMs 
with deterministic design matrices, where the observations may no longer be i.i.d. 

2.2 Kullback-Leibler divergence and parameter identifiability 

The KL divergence of density / from density g, which was introduced by KuUback and 
Leibler (1951), is defined as 

I{9'J) = j [\og g{u)]g{u)du - j [log f {u)]g{u)du, 

which gives an upper bound on the squared Hellinger distance between distributions. The 
KL divergence of the posited GLM F„(-,/3) in ^ from the true model Gn is 



H9n;fn{-,(3))= [\og gn{z)]gn{z)dz - [logfn{z,f3)]gniz)dz 

= ^\ [^osan,iiz)]9nAz)d'Z - / [log fo{z,9i)]gn^i{z)dz i = '^Iign,ufQ{-,6i))- (7) 

Throughout the paper, the expectation and covariance are taken with respect to the true 
distribution G„ unless specified otherwise. We need the following basic regularity condition. 

Condition 2. For each 1 < i < n, hi{9) = Elog /(){¥{, 9) is smooth in 6, and the differ- 
entiation and expectation are exchangeable so that h'-{9) = E d log f^iYi, 9) / d9 and h'l{9) = 
Ed'^log fo{Yi,9)/d9'^, which are also smooth in 9. 

The following theorem shows that there is a unique distribution i^„(-, /3„ q) in the family 
of misspecified GLMs in ([2|) that has the smallest KL divergence from the true model Gn- 

Theorem 1. (Parameter identifiability). Assume that Conditions 1 and 2 hold. Then the 
KL divergence I {gn, fn{- , (3)) has a unique global minimum ai /3„ q € R'^, which solves the 
equation 

X'^[^Y-/x(X/3)] = 0. (8) 



This shows that the parameter identifiabihty problem has content in misspecified models. 
The uniqueness entails that -F„(-,/3„o) = Gn when the model is correctly specified, i.e., 
G„ e {Fn{-,(3) : (3 e R"^}. Since X^/x(X/3„^o) = X^£;Y, we have 

COV [^niPnfi)] = cov (X^Y) = X^cov(Y)X = B„, (9) 

where cov(Y) = diag{var(yi), • • • , var(y„)} by the independence assumption. We introduce 
another matrix 

9"'^SnJn{;P)) ^ _d'in{y,P) ^ x^5:(x/3)x ^ AM (10) 

and define A„ = A„(/3„ q). In view of (I2D, A„(/3) is exactly the covariance matrix of X Y 
when Y has distribution Fn{-,P), and thus A„ is the covariance matrix of X Y under the 
best misspecified GLM Fn{-, (3^q), whereas B„ is the covariance matrix of X Y under the 
true model G„. It is known in classical maximum likelihood theory that A„ = B„ when 
the model is correctly specified. These two matrices play a pivotal role in quasi-maximum 
likelihood estimation of misspecified models. 

2.3 Consistency and asymptotic normality of QMLE 

Early work on the consistency of estimators for the parameters of interest when the prob- 
ability model is misspecified includes Berk (1966, 1970) and Huber (1967). Berk (1966, 
1970) takes a Bayesian approach, and Huber (1967) provides conditions built on those of 
Wald (1949) for the consistency of maximum likelihood estimates. White (1982) uses the 
KL divergence as a measure of model misspecification and extends conditions on the asymp- 
totic properties of maximum likelihood estimates given by LeCam (1953) to QMLE under 
misspecified models. 

We need the following regularity conditions to establish the consistency of the QMLE /3^ 
in misspecified GLMs introduced in Section [2.11 



Condition 3. Ainin(B„) — )• oo as n ^f oo, where Amin(') denotes the smallest eigenvalue. 

Condition 4. There exists some c > such that for any 5 > 0, min^ ^^ ,^. Ainin[V„(/3)] > c 
for all sufficiently large n, where V„(/3) = B„ A„(/3)B„ and Nn{6) = {/3 G R*^ : 
\\{n^^'Rn)^''^{P — /3„o)|| < n^^''^6} with \\ ■ \\ the Euclidean norm. 

Condition 3 requires that the smallest eigenvalue of X cov(Y)X diverges as n increases. 
If var(li) are bounded away from and cxd, then it is equivalent to the usual assumption on 
the design matrix X in linear regression for consistency, i.e., Aniin(X X) — )• cx) as n ^ oo. 
Intuitively, Condition 4 means that the covariance structures given by the misspecified GLMs 
in a neighborhood of the best working model -F,i(-,/3„o) cannot be too far away from the 
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covariance structure under the true model G„. This requirement is analogous to that in 
importance sampling, i.e., the support of the sampling distribution has to cover that of 
the target distribution. When the model is correctly specified, we have V„(/3„o) = ^d 
since A„ = B„ by the equivalence of the Hessian and outer product forms for the Fisher 
information matrix. 

Theorem 2. (Consistency). Under Conditions 1-4, the QMLE [3^ satisfies f3.^^ — (^no — 

Op(\). 

To obtain the asymptotic normality of the QMLE /3„, we need two additional regularity 
conditions, where the first one is on the continuity of a matrix- valued function and the 
second one is a moment condition. For any /3i, • • • ,/3(^, denote by A„(/3i, • • • ^^^) a d x d 
matrix with j-th row the corresponding row of A„(/3j) for each j = !,••• ,d, and define 
matrix-valued function V„(/3]^, • • • , (3^) = ^n ^n{Pi, ■ ■ ■ , (^dJ'^n 

Condition 5. For any 6 > 0, laaxg g ^^ z^-, ||V„(/3x,--- ,l3fi) — V„|| = o(l), where 
V„ = V„(/3„o) = B„ A„B„ and \\ ■ \\ denotes the matrix operator norm. 

Condition 6. maxf^^ E\Yi - EY,i\^ = 0(1) and YJLii^J^n^^if^ = o{l). 

Theorem 3. (Asymptotic normality). Under Conditions 1-6, the QMLE 13^^ satisfies 



Cn(3„-/9„n) AiV(0,/, 



where C„ = B„ i 



—111 \li 

When the model is correctly specified, we have C„ = B„ A„ = A„ since A„ = B„. 

Thus in this case, the consistency and asymptotic normality of the QMLE /3„ in Theorems 

2 and 3 become the conventional asymptotic theory of the MLE. The asymptotic normality 

of the QMLE /3„ in misspecified GLMs provides the theoretical foundation for proposing 

the new model selection methodology. To simplify the technical presentation, the above 

asymptotic normality is for fixed dimensionality d. With more delicate analysis, one can 

show the asymptotic normality for diverging dimensionality d, which is not the focus of the 

current paper. See, e.g., the technical analysis in Fan and Lv (2009) for penalized maximum 

likelihood estimator, where the dimensionality can grow non-polynomially with sample size. 

Since the QMLE /3„ provides a consistent estimator of /3„ g in th^ best misspecified GLM 

F„(-,/3„o)) natural estimates of matrices A„ and B„ are, respectively. 



and 



A„ = A„(/3„) = X^S(X/3„)X (11) 

%n = X^diag { [y - /x(x3j] o [y - ^(x3j] } X, (12) 

7 



where o denotes the Hadainard (componentwise) product. B„ is an unbiased estimator of B„ 
when E'Y = /2(X/3„o)- Ii^ practice, one can construct other estimates of B„ by, e.g., using 
bootstrapping or treating the squared residual [y — /i(X/3„)] o [y — /x(X/3„)] as a function of 
the corresponding fitted value /x(X/3„), or some covariate, and doing some local smoothing. 
For simplicity, we use the estimate B„ defined above in all the numerical studies. 

3 The Kullback-Leibler divergence principle and GAIC 

After choosing a sequence of subsets {9Jtm : w. = 1, • • • , M} of the full model {!,••• , j?}, we 
can construct a sequence of QMLE's {Pnm • "^ = Ir'' i M} by fitting the GLM ([2]) as 
described in Section [2j A natural question is how to compare those fitted models. In this 
section we consider the KL divergence principle of model selection introduced by Akaike 
(1973, 1974), who studied the case of i.i.d. observations with correctly specified model. 

3.1 Kullback-Leibler divergence principle of model selection 

The QMLEs {/3„ „ : m = 1, • • • , M} become the MLEs when the model is correctly specified. 
Akaike's principle of model selection is choosing the model 9Jtmo that minimizes the KL 
divergence I{gn', fni-,(inm)) of the fitted model Fn{-,(3,^^) from the true model Gn, i.e., 

mo = arg min I{gnJn{-,Pn,m))- (13) 

■me{l,---,Af} 

In view of ([7]), we have 

I{gn;fn{-3n,rn)) = ^ log 5n(Y) - ?7„(3„,m), (14) 

where Tin{(3) = i?^„(Y, /3) and Y is an independent copy of Y. Thus 

mo = arg max ry„(3„,„) = arg max E~in{Y,P ), 

me{l,---,A/} me{l,---,M} I 

which shows that Akaike's principle of model selection is equivalent to choosing the model 
5Jtmo that maximizes the expected log-likelihood with the expectation taken with respect 
to Y, an independent copy of Y. Fix an arbitrary 9K = 9Jtm and /9„ = Pnm- Using the 
asymptotic theory of MLE, Akaike (1973) shows for the case of i.i.d. observations that 
VniPn) defined in (fn|) can be asymptotically expanded as iniy,l3n) ~ 1^1) which then leads 
to the seminal AIC for comparing competing models: 

AIC(y,9Jt) = -24(y,3j + 2\m\. (15) 

The factor of 2 was historically included due to its connection with log-likelihood ratio tests. 



3.2 GAIC in misspecified models 

In this section, we derive a generalization of AIC in misspecified models. AIC has been 
studied by many researchers, mainly for the case of i.i.d observations. Takeuchi (1976) 
generalizes the derivation of AIC and obtains a term tr(A~ B) in place of the dimension 
of the model, where B and A are two matrices that involve the first and second derivatives 
of the log-likelihood function, respectively. This term was also obtained by Stone (1977) in 
deriving the asymptotic expansion for the cross-validation. In fact, tr(A^ B) is the well- 
known Lagrange-multiplier test statistic. See also Hosking (1980), Shibata (1989), Konishi 
and Kitagawa (1996), Burnham and Anderson (1998), and Bozdogan (2000). 

For simplicity, hereafter we drop the last term in the quasi- log-likelihood £„(y,/3) given 
by @, which does not depend on /3, redefine it as 

4(y,/9)=y'^X/3-l^b(X/3). 

Fix any model Tim in the sequence of competing models. We drop the subscript m to 
simplify the presentation. In view of (|14p . for the purpose of model comparison, we need to 
estimate the quantity r/„(/3„). In practice, we have only a single sample y. The maximum 
quasi-log-likelihood £„(y,/3„) tends to overestimate T]ni(3n) since we would use the same 
data twice, i.e., in both estimation and model evaluation. More specifically, in{y,(3n) tends 
to overestimate £„(y,/3„Q), which is an unbiased estimate of rjnifino) — -E'^n(Y, /3„ q), and 
VniPno) tends to overestimate r]n{Pn) since F„(-,/3„o) is the best misspecified GLM under 
the KL divergence. We need the following regularity condition for deriving the asymptotic 
expansion of Er]n{Pn)- 

Condition 7. tr (A^^B„) = 0(1), sup^maxj^^j \d^r]n{(3nfi+n^^'^C~^S)/d5jd5kd5i\ = o{ti?/'^), 

and for some 5 > 0, -E||C„(/3„ — /3„o)lP^'^ — 0{\), where C„, = B„ A„ and 5 = 

(<^i,---,<5rf)^. 

Theorem 4. Under Conditions 1-1, we have 

^r?„(3n) =i?4(y,3n) -tr(A-iB„) +o(l), (16) 

where both expectations are taken with respect to the true distribution Gn- 

As mentioned before, a correction term of form tr (A~ B„) is well known in the literature, 
but has been usually derived for the case of i.i.d. observations and often by heuristic argu- 
ments. Following the asymptotic expansion in the above theorem, we have the generalized 
AIC (GAIC) as follows. 

Definition 1. The GAIC of the competing model 9JI is 

GAIC(y, m- Fn) = -2^„(y, X) + 2tr (a^'b^) , (17) 



where A„ and B„ are estimates of A„ and B„ given in Section \2.3\ 



GAIC shares the common feature of generahzations of AIC. It incorporates the effects 
of model complexity and model misspecification in a single term. Since tr(A„ B„) = 
tr(B„ A^ B„ ) > 0, the term added to the negative maximum quasi-log- likelihood in 
GAIC is indeed a penalty term. When the model is correctly specified, tr(A~^B„) = tr(/rf) = 
(i = [5Jt| since A„ = B„, which is the number of predictors that drive the GLM ([2]). Thus we 
would expect tr(A„ B„) ^ tr {I^) = \Tl\, which reflects the penalty on model complexity as 
that in AIC. As mentioned before, many extensions of AIC under different settings share the 
same form as in p!7|) . We point out that equation (4.5) in Stone's derivation (Stone, 1977) 
suggests that under the same working models, GAIC and cross-validation are asymptoti- 
cally equivalent under some weak conditions, where the cross-validation uses the log-density 
assessment. 



4 The Bayesian principle and GBIC 

4.1 BIC and KL divergence 

Suppose we have a set of competing models: {Tim : m = 1, ■ ■ ■ , M}. A popular Bayesian 
model selection procedure is to first put nonzero prior probability agjtm o^ each model Tim, 
and then give a prior distribution /igjj^ for the parameter vector in the correspnding model. 
Assume that the density function of fJ-m^ i^ bounded in R^™ = R"^™ and locally bounded 
away from zero throughout the domain. The Bayesian principle of model selection is to 
choose the most probable model a posteriori. That is, to choose model Tlmo such that 

mo = arg max S{y,Tlm;Fn), (18) 

me{l,--- ,M} 

where 

5(y, Tim-, Fn) = log I aan™ exp [£„(y, f3)] dfi^JP) (19) 

with the log-likelihood iniy^f^) as in (jj]) and the integral over R '". Fix an arbitrary Tl = 
Tim and /3„ = /3„m- Using the asymptotic theory of MLE, Schwartz (1978) shows for the 
case of i.i.d. observations with correctly specified model in a regular exponential family that 
S{y,Tl;Fn) ~ ^n(y,/3n) — ^^|9JI| + logagjl) where /3„ is the MLE of (3. This expression 
allows Schwartz to introduce the seminal BIC for model selection: 

BIC(y,9K) = -2£n{y3n) + (logn) \Tl\, (20) 

where a factor of 2 is included to make it consistent with AIC in (|15|) . Schwartz's orig- 
inal arguments were generalized later by Cavanaugh and Neath (1999) via weakening the 
assumptions. 



10 



For each m G {1, • ' ' j-^}) the marginal probabihty Vm of the response vector Y condi- 
tional on model Tim has the density function 

^(z) = y"exp[4(z,/3)]dM™^(/3), zGR", (21) 

where /io is the Lebesgue measure. Similar to d?]), the KL divergence of Vm from the true 
model Gn is 

-f(ffn;-j^j = [log gn{z)]gniz)dz - / log— ^(z) gn{z)dz 

= E log gn(Y) - E log J eicp[en{Y, (3)] dm^{(3), (22) 

where leads to the following proposition. 

Proposition 2. a) For each m G {1, • • • , M}, we have 

ES{Y,mm;Fn) = -I L,;^] + log am^+E log gniY), (23) 

where the expectation is taken with respect to the true distribution Gn- 
b) Assume agjti = • • • = amM- Then we have 

arg max ES{Y,Tlm.; En) = aig min I i gn, 



m6{l,---,A^} mG{l,---,A:f} V "'1^0 

Proposition 2 holds regardless whether or not the true model is in the set of candidate 
models. We see also that for each m G {!,■■■ ,M}, —S{y,Tlm', En) + loga5rji™ gives, up 
to a common additive constant, an unbiased estimate of / ( 5„; -J^ I . We also see that the 
Bayesian principle of model selection can be restated as choosing the model that minimizes 
the KL divergence of the marginal distribution of the response vector Y from its true dis- 
tribtution, provided that we assign equal prior probabilities on the M competing models. 

4.2 GBIC in misspecified models 

In this section, we derive the asymptotic expansion of the log-marginal likelihood <S'(y, 9Jt; En) 
in misspecified models. For notational convenience, we drop the subscript m. In the GLM 
([2]), the parameter vector G R" for the response vector Y is given hy 6 = X/3. Under 
this model and some regularity conditions, as in Schwartz (1978) we can apply the classical 
Laplace approximation (i.e., Taylor-expanding the exponent to the second order and using 
the Gaussian integration to obtain necessary terms) to get that 

S(y, Tl; En) = 4(y, 3„) + log J e-(/3-3 J" A„(3 J(/3-3 J/2^^ ^ ^, 

= 4(y,3„) -^^m\-\ log \n-'An{X)\ + Cn, (24) 
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where /3„ is the maximizer of 4(y,/3), A„(/3) = -d'^ln{y,f3)/dp'^ = X'^S(X/3)X, and C'^ 
and Cn are both bounded in n. However, when the model is niisspecified, it is no longer 
clear whether the Cn term is still bounded. 

Note that the term — ^ log |n~^A„(/3^)| in the asymptotic expansion (pH) depends on 
the scale of the design matrix X. Thus this second order term is not very meaningful. In 
our derivation below, we correct this by rescaling the sequence of design matrices {X = 
(xi,--- ,x„)-^ : n = 1,2,---} before we do the Bayesian inference. Let P„ be a d x d 
nonsingular matrix. We apply a linear transformation to the design matrix, X — >■ X = XP„. 
Although there are infinitely many choices of P^, we consider one such that n~^'^X Y has 
the identity covariance matrix Id under the true model Gn- Thus 

h = cov (n-i/2x^Y) = P^cov (n-i/Sx^Y) P, = P^ (n-^B^) P„, (25) 

where B„ = X cov(Y)X is the covariance matrix of X Y under the true model G„. It is 
known that any solution P„ to equation (j25|) admits a decomposition P„ = (n~^B„)~^/^Q, 
where Q is some d x d orthogonal matrix. Thus, our choice of P„ is unique up to an 
orthogonal matrix. As seen later, this is sufficient for deriving the asymptotic expansion of 
S{y, mt; Fn) since only the determinant of P„ comes into play. For simplicity, we take P„ = 
{n^^'Bj-i)~^ , whose inverse appeared in Nn{S) in Condition 4 for establishing the consistency 
and asymptotic normality of the QMLE /3„. In the specific case of linear regression with 
error variance r^, P„ becomes T~^{n~^^ X)^-'^'^. 

Let {X = XP„ = X(n~^B,„)^^'^ : n = 1, 2, • • • } be the sequence of aligned design matri- 
ces. Then we have the parameter space {/3 : /3 G R'^} and thus /3 = P„,/3 = {n^^'Bn)~^''^(3- 
Hereafter the prior distribution fisjji is understood on the parameter (3. Let /xq be the 
Lebesgue measure on R"^, and vr(/3) = -^^{P) the prior density on /3 for model 9JT. De- 
fine 6 = (n-^B„)i/2(^ _ ^^)_ Then with a change of the variable ^ = (n-iB„)i/2^ = 
S + (n~^B„)^/^/3„, the prior density of fi^ can be written in terms of S as 

dfiwiP) = 7r(3)d3 = Tr(6 + n^^/^Bl/%) dS. (26) 

For any S G R , we define 

£:(y, d) = in{y, 3„ + n'^'B-'/^S) - £n{y, 3n), (27) 



which is the deviation of the quasi-log- likelihood from its maximum. It follows from ()19p . 
(I2SD and dSZl) that 

5(y, 9Jt; F^) = log a^ + 4(y, X) + log ^/.,,, [f/n(5)"], (28) 

where Un{S) = exp [n~^^* (y, 5)] . We need the following two regularity conditions to derive 
the asymptotic expansion of the above term log Ef^^^[Un{S)^]. 
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Condition 8. There exist some ro,ci,C2 > such that the prior density vr(/3) satisfies 
inf 7r(s + n-^^'^Bl/'^P^) > ci and sup it (s + n~^^'^Bl/'^P^) < C2. 

\\S\\<ro ^ ' ,5gR<* ^ ' 

Moreover, |[n~^'^B„ /3„|| is hounded away from and oo. 

Condition 9. Amax(D„) = 0(1), and there exists a sequence of rn > with rn = o(l) such 
that pn{rn) < C3Amin(Dn) and Amin(Dn)"^r^^ = 0(n") for some C3,a e [0, 1), where D„ = 

B^^/^A„(3n)Bn^''^ Pn{r) = maX||^||^^ max{| Amm[F„(5)] |, |Ainax[Fn(5)]|} with¥n{8) = Bn'^^'^ 

[A„(/3^ + Bn S) — An{Pn)]Bn , and Xma^i) denotes the largest eigenvalue. 



In view of (p6|) . Condition 8 requires that the prior density for /3 is bounded and locally 
bounded away from zero. Since the relative scale of /3 compared with the original parameter /3 
is given by 3 = 71-^26^^/3, the condition that \\n-'^l'^B]l^]5.J, = [n-i(x3„)'^cov(Y)(x3„)]^/^ 
is bounded away from and oo justifies the use of prior on parameter (3. The technical proof 
applies as well to the case when Amax(Dn) diverges as n — )• oo. Conditions 8 and 9 are 
sensible by noting that the QMLE /3„ is shown to have asymptotic normality under some 
regularity conditions in Theorem 3. 

Theorem 5. Under Conditions 1, 8 and 9, we have 

■^ loE 77. 1 ^—1 

5(y,5Jt;F„) = f„(y,/3„)--|-|9?T| + -log|A„ B„| + loga^ + i?(y,9Jt; F„), (29) 

where A„ = A„,(/3„) and the remainder i?(y, 9K;-F„) is hounded in n. 

i^~l l/2^~l 1/2| 

In Condition 9, log |A„ B„| = log [B,„ A^ B„ | is allowed to diverge as n ^ oo since 
it is only assumed that Amax(Bn A„ BJ ) = Amin(Dn)^"'^ = 0{n'^r'^), where a G [0,1). 
As mentioned before, Amin(Bn A„ B„ ) = Aniax(Dn)~"^ can also be allowed to converge to 
zero as 71 ^ oo. Following the asymptotic expansion in the above theorem, we introduce the 
generalized BIC (GBIC) as follows. 

Definition 2. We define GBIC of the competing model 9JI by 

GBlC{y, m;Fn) = -24(y,3n) + (log7i) \m\ - log|A;'B„|, (30) 



where A„ and B„ are estimates of A„ and B„ given in Section \2.3X 

Compared with BIC, GBIC takes into account model misspecification explicitly. The 
second term in GBIC is the same as that in BIC and penalizes the model complexity. How- 
ever, generally the third term — log|A^ B„| in GBIC is not always nonnegative. When 
the model is correctly specified, we would expect A„ k. B„ since A^, = B„ and thus 
log |A„ B„[ Ri log |/rf| = 0. In this sense, GBIC introduced above generalizes BIC. 
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5 The semi-Bayesian principles and SIC methodology 

In this section we introduce a family of semi-Bayesian principles for model selection in mis- 
specified models that bridge the two well-known principles: the KL divergence principle and 
the Bayesian principle for model selection, which have been considered in Sections [3] and HI 
The semi-Bayesian principles give the new SIC methodology. Assume that we have a sequence 
of competing models indexed by subsets {Tim : m = 1, ■ ■ ■ , M} of the full model {!,••• ,p} 
of all the p covariates. We can construct a sequence of QMLE's {/3„ ^ : m = 1, • • • , M} as 
in Section [3l and fit the GLM ([2]) using a Bayesian approach as in Section [H 

5.1 Semi-Bayesian principles of model selection 

We define the semi-Bayesian principles of model selection as follows. 

Definition 3. The semi-Bayesian principle of model selection with index 7 € [0, 1] is choos- 
ing the model 97Tmo ^^^^ minimizes 

D-yidn^m; ^m) = (1 " l)I{9n] fn{-3n,m)) " 7*^(5n; /«(-, /3„,m,o)) + 7^ ( ^ni J^ j , (31) 

where j^ =jA{l-j), /(5„; /„(-,/3„ „)) is defined in (1^, /(^n; /«.(-, /3„,m,o)) is the minimum 
KL divergence of the misspecified GLM Fn{-,(3) from the true model Gn over P € R' "I, and 
L{gn', -^) is defined in h22\) with v^ given by h21\) the marginal distribution of the response 
vector Y conditional on model 9Jtm ■ That is, 

mo = arg min i:>^(/3„„;9Jtm). (32) 

mg{l,---,A/} 

We see that the semi-Bayesian principle of model selection with index 7 = becomes the 
KL divergence principle of model selection, and that with index 7 = 1 is equivalent to the 
Bayesian principle of model selection by its KL divergence interpretation given in Proposition 
2. It is easy to see that D^(/3„ ^;9Jtm) > for any 7 € [0,1] since /(^n; /n(')/3n m)) ^ 
Hdn', fn{-,Pnmo)) always holds. In particular, for the semi-Bayesian principle of model 
selection with index 7 = 1/2, 2D^(/3„ ^; Tim) is the sum of the excess KL divergence of the 
model relative to the minimum one and the KL divergence of the marginal distribution of 
the response from the true one. 

The intuition is that when the model is misspecified, even the best marginal distribution 
of Y can be far away from the true distribution Gn- Thus finding a model that compromises 
the terms in ()3ip can be desirable, which combines the strengths of the two well-known 
principles. For simplicity we drop the subscript m. The following theorem, which is proved 
using the theory established in Sections [2H11 gives the asymptotic expansion of £'D^(/3„; TV). 
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Theorem 6. Under Conditions 1-9, we have 

ED^iX;^ = -7*EiniyA) + ^^^m\ - ^Elog [A^'b^I + ^tr (A-^B„) 

+ j*Eloggniy)--fERiy,Tl;Fn) + o{l), (33) 

where all expectations are taken with respect to the true distribution Gn, 7* = 7 V (1 — 7), 
7** = (2 - 37) V (1 - 7), A„ = A„(3„), and i?(y,9?T;F„) is given in (2^. 

5.2 The SIC methodology 

The asymptotic expansions of the senii-Bayesian principles in misspecified GLMs give a 
family of semi-Bayesian information criteria (SIC). Following Theorem 6, we define the SIC 
methodology as follows. 

Definition 4. We define SIC with index 7 G [0, 1] of the competing model 9Jt by 

SIC^(y, m;Fn) = -2^*l.{y, 3„) + 7 (log n) \m\ - 7 log IH„I + 7**tr (h„) , (34) 

where H„ = A„ B„ with A„ and B„ estimates of A„ and B„ given in Section {K 



We see that SIC with index 7 = becomes GAIC defined in (flTj) . and SIC with index 
7 = 1 becomes GBIC defined in ()30p . In particular, the semi-Bayesian principle of model 
selection with index 7 = 1/2 gives SIC with index 7 = 1/2. We show next that this specific 
form of SIC in misspecified models admits a natural decomposition of the form 

goodness of fit + model complexity + model misspecification, 

where the first term is the negative maximum quasi- log- likelihood, and penalties on model 
complexity and model misspecification are both nonnegative. As mentioned before, neither 
GAIC nor GBIC has such a decomposition. 

Proposition 3. For any d x d symmetric positive definite matrices A„ and B„ given in 
[34\ ), we have for 7 = 1/2, 

^^ 1 -l- loff n r ^^ ^^ ' 

SlC^iy, m-Fn) = -in{y,Pn) + \ M + I [iV(0, B„); iV(0, A„)J , (35) 

where /[A^(0,B„); A^(0, A^)] = |[tr(H„) — log|Hn| — d] is the KL divergence of the d- 



2 
variate Gaussian distribution A^(0,A„) from the d-variate Gaussian distribution A^(0,B^ 



with H„, = A„ B„,. 



The penalization on model misspecification, i.e., the third term, in SIC with index 7 = 1/2 
has a natural interpretation. Under the true model G„, we have 

E (X^Y) = X^SY and cov (X^Y) = X^cov(Y)X = B„. 
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Under some regularity conditions, we can show that due to the central limit theorem the 
asymptotic distribution of X (Y — fi) is close to A''(0,B„), where fi = EY. On the other 
hand, if we misspecify the true model as the GLM Fn{-,(3), the QMLE /3„ converges to (3„ g 
so that Fn{-,Pno) is the best model in the posited GLM family for approximating the true 
model Gn under the KL divergence. It follows from ^ and ([8]) that 

Ef„ (X^Y) = X^£;^„Y = X^/x and cov^.^ (X^Y) = X^S(X/3„,o)X = A„, 

where F„ = F„(-,/3„o) and /x = Eg„Y. We can again show that under some regularity 
conditions, the asymptotic distribution of X (Y — fi) is now close to A^(0, A„). Thus the 
KL divergence of iV(0, A„) from A^(0, B„) gives a natural measure of model misspecification. 
This gives another justification for the penalization term on model misspecification in SIC 
with index 7 = 1/2. When the model is correctly specified, we would expect A„ ~ B„ since 
An = B„ and, thus, /[A^(0,B„); A^(0, A„)] r^ 0. For simplicity, SIC in all the numerical 
studies implicitly refers to the specific form of SIC with index 7 = 1/2. 

6 Numerical examples 

6.1 Polynomial regression 

We simulated 100 independent datasets from the cubic polynomial model 

y = 1 + 5x - 1.25x^ + 0.55a;^ + e, 

where x ~ -/V(0, 1), e ~ N{0,a'^), and e is independent of x. Each dataset contains n i.i.d. 
samples, where n = 25,50,200 and a = 0.5,1,2, respectively. For each dataset, we fit the 
polynomial regression model of order from 1 up to 6 and used AIC, BIC, GAIC, GBIC, and 
SIC to select the order of polynomial regression. Table [T] summarizes the comparison results 
of the frequency of estimated order of polynomial regression model. As expected, AIC tended 
to select larger model than the true one. Interestingly, GAIC performed no better than AIC. 
BIC performed well when the sample size is large but suffered from smaller sample sizes. 
Almost all times both SIC and GBIC outperformed the other information criteria, while SIC 
performed the best by a small margin. 

6.2 Best subset linear regression 

We are curious to see how the new information criteria fare in the linear regression model 
selection scenarios. We simulated 100 datasets from the linear model 

y = X/3o + e, (36) 
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Table 1: Frequency of estimated order of polynomial regression model over 100 simulations 
for different n and a 







Model size {n 


= 25) 


Model 


size {n = 


50) 


Model size {n 


= 200) 


(J 


Criterion 


2 


3 


4 


5 6 


2 


3 


4 


5 


6 


2 


3 


4 


5 6 


0.5 


AIC 





68 


13 


8 11 





72 


16 


6 


6 





73 


17 


3 7 




GAIC 





34 


20 


19 27 





55 


18 


10 


17 





35 


19 


22 24 




BIC 





85 


5 


5 5 





94 


5 


1 








94 


6 







GBIC 





100 











99 


1 











98 


2 







SIC 





100 











99 


1 











99 


1 





1 


AIC 





72 


15 


8 5 





72 


13 


9 


6 





80 


11 


5 4 




GAIC 





37 


25 


15 23 





55 


14 


17 


14 





47 


19 


13 21 




BIG 





87 
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5 





94 


6 











100 










GBIC 





99 


1 
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100 










SIC 





99 


1 








100 














100 
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AIC 


1 


75 


10 


10 4 





81 


5 
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6 





79 


13 


4 4 




GAIC 





34 


20 


20 26 





56 


11 


10 


23 





46 


21 


14 19 




BIG 


2 


90 


3 


3 2 





90 


5 


4 


1 





99 





1 




GBIC 


15 


85 








1 


96 


3 











100 










SIC 


8 


92 











97 


3 











100 









where X is n x p design matrix and e ^ N{0, cr'^In) independent of X. For each sim- 
ulated dataset, the rows of X were sampled as i.i.d. copies from A^(0, Sq) with Sq = 
(0.5l*-^l)ij=i,...,p. We fixed p = 6 and /3o = (1,-1.25,0.75,0,0,0)'^ and considered n = 50 
and 200 and a = 0.5 and 1, respectively. For each dataset, we applied the best subset 
regression and used AIC, BIC, GAIC, GBIC, and SIC to select the model size. Table [2] 
summarizes the comparison results of the frequency of estimated model size. Again AIC 
tended to select larger model than the true one, and it is surprising that GAIC performed 
no better than AIC. The performance of BIC deteriorated when the sample size is not large. 
Both SIC and GBIC outperformed the other information criteria, while SIC performed the 
best. The inclusion of second order term in both SIC and GBIC partially explains their good 
performance in small samples. 

6.3 Linear regression with interaction 

Here we test how each information criterion behaves when the true model is not among 
the family of candidate models for selection. We simulated 100 datasets from the following 
model, which adds an interaction term to the linear model ()36p in Section 16. 2t 

y = X/3o + 0.5xp+i + e, 
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Table 2: Frequency of estimated model size by best subset regression over 100 simulations 
for different n and a 
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11 
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90 


10 











95 


5 







SIC 


1 


95 


4 











97 


3 






where X = (xi 



T 



5 ^r, 



, Xp) is n X p design matrix with Xj = {xij, 

Xp+i = (xi,p+i, • • • , x„,p+i)^ with Xj,p+i = Xii ■ Xi2 



and e ~ N{0,a'^ln) independent of X. We kept the rest of the setting in Section [612] except 
that Eq = Ip and n = 50, 200, 1000, and pretended that the data were generated from model 
(f36|) with predictors Xi,- ■ ■ ,Xq. So the family of our working models does not include the 
true model due to the presence of the interaction term, 0.5 times the product of the first 
two predictors. For each dataset, we applied the best subset regression and used AIC, BIC, 
GAIC, GBIC, and SIC to select the model size. Table [3] summarizes the comparison results 
of the frequency of estimated model size. The conclusions are similar to those in Section 
6.21 It is interesting to note that since the interaction term X1X2 is uncorrelated with all 
the candidate variables Xi , • • • , Xq , the "correct" number of predictors for this misspecified 
model should be 3. 



6.4 Nonlinear regression 

Here we simulate a single-index model, i.e., Y is dependent on a linear combination of x 
through a nonlinear link function. Let 

^2 



/(^) 



a + z 



and f{z) = {f{zi),--- ,f{zn)) , z = (zi, • • • , z„) . 



We simulated 100 datasets from nonlinear regression model 

y = f(X/3o) + e, 



(37) 
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Table 3: Frequency of estimated model size by best subset regression over 100 simulations 
for different n and a 
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where X is an n x p design matrix and e ~ -/V(0, /„) independent of X. We set a = 0.25, 0.5 
in ([37|) and kept the rest of the setting in Section [612] except that Sq = Ip and n = 50, 150, 
and 1000. We pretended that the data were generated from model ()36p . So the family of 
our working models certainly does not include the true model due to the nonlinearity. For 
each dataset, we applied the best subset regression and used AIC, BIC, GAIC, GBIC, and 
SIC to select the model size. Table S] summarizes the comparison results of the frequency 
of estimated model size. The conclusions are similar to those in Section 16. 3[ We see as well 
that GAIC had no clear advantage over AIC in some of the settings. 



6.5 Polynomial regression with heteroscedasticity 

Since our new criteria add terms related to the variance of the response variable Y, we 
considered a cubic polynomial model with heteroscedastic variance 



y = 1 + 5x 



1.25x2 + 1.55x^ + |x|^/^£ 



where it is assumed that e ~ N{0, o"^) and is independent of x. We kept the rest of the setting 
the same as in Section [6. II and set n = 50, 200, 5000 and a = 0.5, 1. Our model assumes that 
the data were generated from a polynomial model with constant variance. So the family of 
our working models does not include the true model due to the heteroscedasticity. For each 
dataset, we fit the polynomial regression model of order from 1 up to 6 and used AIC, BIC, 
GAIC, GBIC, and SIC to select the order of polynomial regression. Table [S] summarizes 
the comparison results of the frequency of estimated order of polynomial regression model. 
We see that GAIC performed no better than AIC, and both of them tended to select large 
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Table 4: Frequency of estimated model size by best subset regression over 100 simulations 
for different n and a 
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model. SIC and GBIC performed significantly better than all other information criteria, 
indicating the usefulness of capturing the difference between the estimated error variance 
and the apparent residual variance. 

7 Discussions 

We have considered the problem of model selection in misspecified models and introduced 
a family of semi-Bayesian principles with indices 7 G [0, 1] connecting the two well-known 
principles: the KL divergence principle and the Bayesian principle for model selection. The 
asymptotic expansions of the semi-Bayesian principles in misspecified GLMs give a family 
of semi-Bayesian information criteria (SIC) with indices 7 G [0,1]. Considerations of the 
two well-known principles in misspecified GLMs lead to generalizations of AIC and BIG, the 
GAIC and GBIC. In particular, a specific form of SIC with index 7 = 1/2 has a natural 
decomposition into the negative maximum quasi-log-likelihood and two penalties on model 
dimensionality and model misspecification, respectively. Numerical studies have demon- 
strated the advantage of the proposed SIC methodology in finite sample performance for 
model selection in both correctly specified and misspecified models. 

When the model is misspecified, the posterior distribution of the parameter would not 
reflect the true uncertainty of the estimation. In particular, the QMLE /3„ can have a 
different asymptotic covariance matrix than the posterior covariance matrix of /3. When this 
difference grows larger as sample size n increases, its impact on model selection can be rather 
significant. 
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Table 5: Frequency of estimated order of polynomial regression model over 100 simulations 
for different n and a 
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Our technical analysis has revealed that the contrast (i.e., A^ B„) between the covariance 
structure in the misspecified model and that in the true model plays a pivotal role for model 
selection in misspecified models. So it is important to construct accurate estimates of both 
covariance matrices in practice. Some suggestions have been discussed in Section 12.31 So 
far we have explored the expression of SIC with index 7 = 1/2 taking an additive form 
for the three terms: goodness of fit, model complexity, and model misspecification. Possible 
extensions of this specific form of SIC include introducing other weights and considering non- 
additive forms. These problems are beyond the scope of this paper and will be interesting 
topics for future research. 

A Proofs 

A.l Proof of Proposition 1 

It suffices to show that the Hessian matrix of —in{y,P) is always positive definite. In view 
of (JD, we have -d'^£n{y, l3)/dl3^ = X'^5:(X/3)X. Fix an arbitrary /3 G R"*. By assumption 
the minimum diagonal element a of S(X/3) is positive. Thus it follows easily that 

X^S(X/3)X > X^(a7„)X = aX^X > 0, 

since X has full column rank d. This completes the proof. 
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A. 2 Proof of Theorem 1 

By ([7]) and Condition 2, it is easy to show that the Hessian matrix A„(/3) of I {gn', fni' , P)) 
is X 5](X/3)X. So the proof of Proposition 1 apphes to prove the uniqueness of the global 
minimizer of the KL divergence I{gn', fn{--,P))- It follows from ([7]) and Condition 2 that the 
minimizer solves the gradient equation 

dl{gn; /„(•, /3))/5/3 = -^^[EY - /^(X/3)] = 0, 

which concludes the proof. 

A. 3 Proof of Theorem 2 

It is easy to see that Nn{S) is a convex set by its definition. Denote by 

dNn{6) = {/3 : II {n-'B^f' {(3 - /3„,o)ll = n-^^d] 
the boundary of the closed set Nn{S). We define an event 

Qn = I 4(y, Pn,o) > ^ max 4(y, (3) \ . 
[ PedN„{S) J 

We observe that if the event Qn occurs, the continuous function £„(y, •) has a local maximum 
in the interior of Nn{5). Since Condition 1 implies that iniy, •) is strictly convex, this 
maximum must be located at /3^. This shows that 

Qn C iX e Nn{d)}. 

Next we construct a lower bound on P{Qn)- By Taylor's theorem, we have for any /3, 

4(y,/3) - ^n(y,/3„,o) = (/3 - /3„,o)^*n(/3„,o) -\{P- Pn,of ^n{(3J{(3 - /3„,o), 

where /3^ lies on the line segment joining /3 and /3„o- ^^ make a transformation of the 
variable by letting 

u = 5-'Bi/\p-Pn,o)- 

Then it follows that 

4(y,/3) - 4(y,/3„,o) = du^B-'/^^n{l3n,o) - S^u^Vn{P,)u/2. 

Note that (3 G 5iV„,((5) if and only if ||u|| = 1, and that (3 G dNn{S) implies /3^ G Nni6) by 
the convexity of Nn{S). Clearly 



max u^B-i/2^„(/3„,o) = ||B-V2^„(/3„_o 



llu||=i 
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and by Condition 4, for n sufficiently large we have 

min u^V„(/3Ju > min Amin {V„(/3)} > c. 

I|u||=i (3eN„{5) 

Thus we have 

max £„(y,/3)-^„(y,/3„,o)<'J(l|B-i/2*„(/3„,o)II-cV2), 
P&dN„{S) 

which along with Markov's inequality entails that 

i?||B-^/'*„(/3„n)f 



P (Q„) > P ||B-i/2*„ (/3„^o) f < c2<5V4 > 1 



C252/4 
Observe that 

= tr {E [*„,(/3„,o)*n(/3„,o)'^] B-1} = tr(/,) = d. 
For any given rj S (0, 1), letting 6 = ^-172" thus makes 

4(i 

This together with Qn C {/9„ S Nn{5)} and Condition 3 proves /3„ — /3„ = op(l). 

A. 4 Proof of Theorem 3 

We condition on the event Qn defined in the proof of Theorem 2, which has been shown 
to entail that /3„ G Nn{5). By the mean value theorem applied componentwise and ([8]), we 
obtain 



= *„(/3„) = *„(/3„,o) - A„(/3i, • • • , /3J(/3„ - /3„,o) 

,,0)) 



X^(y - Ey) - AniPi, ■ ■ ■ ,/3rf)(/3„ - /3„, 



^ —1/2 

where each of /3i, • • • ,/3^ lies on the line segment joining /3„ and /3„ q. Let C„ = B„ A„. 
Then we have 

Cn(3n - /3n,o) = "n + W„, 

where u„ = — B^ X"^(y — Ey) and 



w^ 



V„(/3i,--- ,/3,)-vJ rBy2(3^^_/3 ) 



By Slutsky's lemma and the proof of Theorem 2, we see that to show 

Cn(3„-/3n,o)^^(0,/d), 
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it suffices to prove u„ — )■ N{0,ld) and w„ = op(l). 

Wn = op(l) follows from Qn C {/3„ G A''n(5)} and Condition 5. Finally we show that 
Un — > N{0,lci). Fix an arbitrary unit vector a € R . We consider the asymptotic distribu- 
tion of the linear combination 



a^u„ = -a^B-V2x^(y -Ey) = Y: 



n 



where Zi = — a B„ Xj(yj — Eyi), i = 1, • • • ,n. Clearly Zj's are independent and have mean 
0, and 



^var(zO = a'^B-V2B„B-i/2a = 1. 
By Condition 6, E\yi — Eyi\^ < M for some constant M. Thus we derive 

n n n 

Y.E\z,f = Y: |a^B-V2x,,|3i^|y, - Ey,f < M^^ |a^B-V2x^|3 

n n 

< mY: llaf ||B-V2x,||3 = MY^i^jB-'^^f/' -. 0, 

where we used the Cauchy-Schwarz inequality and Condition 6. Applying Lyapunov's theo- 
rem yields 

n 

a^u„ = ^Zi A7V(0,1). 

i=l 

Since this asymptotic normality holds for any unit vector a G R , we conclude that u„ — >■ 
N{0,lii), which completes the proof. 

A. 5 Proof of Theorem 4 

It suffices to prove 

Er]n{X) = r]n{Pn,o) " ^tr (A-iB„) + o(l) (38) 

and 

Vn{Pn,o) = E [4(y, 3„)] - ^tr (A-^B,) + o(l). (39) 

We will prove ([38]) and (|39|) in separate steps. Define 



4(y, /3) = 4(y, K + n^'^C-^P) and ^„(/3) = r/„(/3„,o + ^'^'C^'/S), 

where Cn = 'Bn A„ and /3 = (/3i, • • • , fidf ■ 

1) We first prove (|38|) . The key idea is to do a second order Taylor expansion of ?7n(/3) 
around and retain the Lagrange remainder term. By Theorem 1, r/„ attains its maximum 
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n 



^„(0)-^v^(C-iA„C-i)v 



at PnO- Thus we derive 

-IYI [d^Vn{f3.)/dl3jdPkd(3i] v^vm 
= Vn{l3n,o) - ^v^ (By2A;;iBy2) V - 1 ^ [d^Uf3,)/dl3j 8(5^8 Pi] v^VkVi, 

j,k,l 

where v = {vi, ■ ■ ■ ,Vii)'^ = n^^''^Cn{(3n ~ f^no) ^^^^ f3^ hes on the hne segment joining v 
and 0. By Condition 7, 



s> 



Since C„(/3„ -/3„o) ^ N{0,ld) by Theorem 3 and ^||C„(/3„ - /3„ o)lr+ = 0(1) for some 
5 > hy Condition 7, Thoerem 6.2 in DasGupta (2008) apphes to show that for any j,k,l, 

E{nvjVk) ^ 5jk and E{n^^'^\vjVkVi\) = 0{1), 

where 6jk denotes the Kronecker delta. These along with tr (A^^B„) = 0(1) by Condition 
7 yield 



EVniPn) = ^n(/3„ n) 



itr (bV^ A„-iBy2) + 0(1)1 + i Y: o{n'")0{n-^l') 



j,k,i 



1 



^n(/3„,o)-2t^(A-'B„)+o(l) 



2) We then prove (p9]) . The key idea is to represent £n{y,Pno) using a second order 
Taylor expansion of iniy, ■) around /3 and retaining the Lagrange remainder term. It is 
easy to see that the difference between 77n(/3) and ^„(y,/3) is linear in /3, which implies that 
each pair of their second or higher order partial derivatives agree. Since Iniy, ■) attains its 
maximum at /3„, we derive 

Uy,f^n,o) = 4(y, v) = 4(0) - ^v^ [d%{y,o) /df3^] v 
- ^^ [53A^(y,/3J/9/3ja/3fcaA] vjvkvi 

= Uy,Pn) - ^v^ [a2^„(-v)/5/32] V- i j; [d^7jnif3.*)/d/3jdl3kd(3i] v, 



VkVl, 



j,k,l 



where v = (ui, • • • , va)'^ = n -^/^C„(/3„ g ~ Pn)i /3* lies on the line segment joining v and 0, 
and /3^^ = (3^-v. 

It follows from sup q max j^k,i \d^'nn{f^)/df3jd(3kd/3i\ = o{v?''^) in Condition 7 that 

|a^^„(/3,J/5/3,a/3fc9AI=o(n3/2) 
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and d'^r]n{f3)/d(3 is Lipschtz with Lipschitz constant o{n^''^). Thus 

V^ [d^?jn{-v)/dP^] V = V^ [d''?]niO)/d(3^] V + o(n3/2) ||vf 

= nv^(C-iA„C-i)v + o(n3/2)||vf 
= nv^(By2A-X/^)v + o(n3/2)||vi|^ 

Since C„(3„ -/3„,o) ^ iV(0,/d) by Theorem 3 and E\\Cn0n - Pn,o)f^^ = 0{1) for some 
(5 > by Condition 7, Thoerem 6.2 in DasGupta (2008) apphes to show that for any j and 
k, 

E{nvjVk) -^ 6jk and E{n^^^\\vf) = 0(1) 

These along with tr (A~^B„) = 0(1) by Condition 7 yield 



r/„(/3„,o) = E£n{y,Pn,o) = E£n(.y,f3, 



ltr(Bl/'A~WA+o(l 



2 



'n -'■^n -"-^n 



+ o(n3/2)0(n-3/2) 
1 
2 



i?4(y,3j-^tr(A-iB„)+o(l). 



This proves the conclusion. 



A. 6 Proof of Proposition 2 



Part a) follows from ()22p and the definition of the log-marginal likelihood S{y,Wm',En) in 
(fT9|) . Part b) is an easy consequence of part a) . 



A. 7 Proof of Theorem 5 

It is easy to see that £*(y, 5) defined in (f27|l is a smooth concave function on R'^ with its 
maximum attained at 6 = 0, 



d'C{y,S)/dS^ = -nB-i/2A,(3„ + nV2B-V25)B; 



1/2 



and Un{S) = exp [n ^£*(y,(5)] takes values between and 1. Thus by Taylor's theorem, 
expanding ^^(y, •) around gives for any d, 

t^{y,d) = -^d^[dH:(y,S.)/dS']S, (40) 

where 5* lies on the line segment joining 6 and 0. Let D„ = B„ A„(/3„)Bn and for 
rG (0,oo), 

Pn{r) = max max{|Amin [F„(5)] |, |Amax [Fn{S)] 1} , 

\\0\\<r 



'1/2 



'1/2, 



where F„(5) = B„^'^ A„(/3„ + Bn'''S)- A„(/3„) 
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p,-l/2 
-tin 



Fix a sequence of r„ G (0,oo) satisfying pn{rn) < Amin(D„) and define a sequence of 
closed balls 

Br^ = {(5 € R'^ : \\S\\ < rn}. 

It follows from (gDD that 



(41) 



where 



Qii^) = 2^^ P" ~ Pn{rn)Id] S and g2(^) = -^S"^ Dn + Pn{rn)Id 



d. 



This entails 

i?;.,^ (e-'^^lB.J < E,,, [Ur^iSrisJ < E,^ (e-"'^UB„J . (42) 

We will see later that inequality (|42p is the key step in deriving the asymptotic expansion of 
logii;^,jC/„(5)"]in(l28D. 

We list some auxiliary results in the following two lemmas, whose proofs will be given 
separately. 



Lemma 1. Under Condition 8 and provided that rn < rQ, we have 

ci I e-""^ 1b.„ dfio < E^^^ (e-'^*- 1b.„ ) < ^2 J e""* 1b,„ d^o, J = 1, 2. 
Lemma 2. It holds that 



E, 



MiS! 



Un{S)"'lBc^ < exp {-UKnrl) , 



-niji 



-n.(j2 



d/iO 



d/iO 



(^) 



d/2 

) |D„ -p„(r„)7d| 



-1/2 



/27rY/' 






\ n J 



|D„+p„(r„)/rf|-i/2, 



27r \ '^/^ 
e "*1bc (i/io< ( exp 



Thr\j^, 



d d 



2 ""22 



+ 77 + TT log {nKnr^d ) 



(43) 

(44) 
(45) 

(46) 

(47) 



where k„ = Amin(Dn) — Pn{fn), j = 1> 2, and ii is assumed that UKnr'^ > d in (4^ - 

Now we are ready to obtain the asymptotic expansion of log E^^j^lUniS)"^] in ([28|) . We 
pick a sequence of positive numbers r„ — )• and work under Conditions 8 and 9. It follows 
from Condition 9 that 

(1 - C3)Anim(D„) < Kn = Ainin(Dn) " Pn{rn) < Amin(D„). 



Thus in view of r„ = o(l) and Amm(D 



-l»,-2 
nj I n 



0{n°'), we have 



A;-i = 0{n''rl) = o{n'^) and n-'-^-''\nKnrl) 



oo, 
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where 1 — a G (0, 1]. Note that Amax(D„) = 0(1). Since an exponential rate of convergence 
to zero is faster than a polynomial rate, by Lemmas 1 and 2 we can claim that for j = 1,2, 

d/2 



which along with ()42p and ()44p yields 



^V ID 1-'/^ 
n / 



+ 0(1), 



logi^^,,jC/„(d)"]=log 



I \^-'n\ 

n J 



+ 0(1) 



This together with ()28p proves the conclusion for fixed y. 

A. 8 Proof of Lemma 1 

Since r„ < tq, by ()26p and Condition 8 we have for any 8 S -Bj.„, 



ci< 



d/uo 



(<5) < C2, 



which immediately gives inequality ([13 



A. 9 Proof of Lemma 2 

In light of the choice of r„ and the definition of Pnifn), it is easy to show that 

UniS)"-lB^^ < exp {-n [Amin(D„) - Pn{rn)] T^] , 

which yields inequality ^^ since ^ajt is a probability distribution. Equations ()15]) and ()i6]l 
can be obtained by using the multivariate Gaussian integral. It remains to prove ()47p . 
By the definition of functions qi and q2, we observe that 

qj{8) > \ [An,in(D„) -/9n(rn)] ¥f. S G 5^„,j = 1,2. 



Let Kn = Amin(Dn) " Pn{rn) and F = HKnld- Then for i = 1,2 

d/2 



!■ 



2tt 



riKr, 



p 



{uK^r'/'Zf > r'„ 



(48) 



where Z '--^ N{0,ld). Assume riKnr'^ > d. Pick e^ G (0,oo) in a way such that (i(l + £„) 
riKnr'^- Then by the deviation bound in Lemma 3(a) of Fan and Lv (2008), we have 



[riKr, 



n)-'/'Zf>r'J=P{d-'\\Zf>l + en)< 



g~^£n" 
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where A^^ = [e„ - log(l + e„)] /2. Thus 



nK„ 



-1/22||2 



> r 



< exp 



-nK„r^ + - + - log {riKnrld ^) 



This along with (|48p concludes the proof. 

A. 10 Proof of Theorem 6 

In view of ([3T|) . we have for 7 € [0, 1/2], 

-D7(3„,m;^m) = (l-27)/(5-„;/„(-,3n,m))+7 lidn, fn{- ,Pn,m)) - Han, fn{- , Pn,m,o)) 



+ 1I 9n; 



dfj.0 

and for 7 G [1/2,1], 

D'y0n,m'^^m) = (1 - 7) Hdn; fn{-,Pn,m)) - Hdn, fn{-, Pn,mfi)) 



+ ii an] 



dUrn 



We drop the subscript m for simplicity. It follows from (J14p and (J16p in Theorem 4 that 



/(ffn; /„(•, 3n)) = E log 5„(y) - E£n{y, 3n) + tr (A-iB„) + o(l). 
By ([H]), we have 

I{9n; /„(■, 3n)) - -^(5n; fn{-,Pnfl)) = Vn{Pn,o) " ^n(3n)- 

This along with (j38p in the proof of Theorem 4 entails 

E l/(5n;/„(-,3„)) -%n;/n(-,/3„,o))] = ^tr (A-^B^) + o(l). 



(49) 



(50) 



It follows from 1^ and (^ in Theorem 5 that 

dl^rn 



EI gn, 



duo 



: -EUy,PJ + ^|97t| - ^i5;iog |A„'b, 
+ Eloggn{y)-ERiy,m;Fn), 



(51) 



where A„ = A„(/3^). Therefore combining ()49P "(|5i p together with the above representations 
of L'^(/3„;9Jt) yields ([33|) . which completes the proof. 

A. 11 Proof of Proposition 3 

The decomposition in ([5^ follows from the definition of SIC^(y,97t;F„) with 7 = 1/2 in 
(|34p and a fact that the KL divergence of a d-variate Gaussian distribution A^(/^2' ^2) from 
a d-variate Gaussian distribution A^(/^i, Si) is given by 

/ [iV(/^i, Si); iV(/X2, S2)] = ^ [tr(S2-iSi) - log |S2^iSiI + {fi^ - fi^f^^^fi^ - /^i) - d] . 
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B Three cominonly used GLMs 

Linear regression model, logistic regression model, and Poisson regression model are three 
commonly used GLMs. In this section, we give the formulas for A„ and B„ given in Section 
12. 3t when F„ is chosen to be one of them. 



B.l Linear regression 

This model assumes 

y = X7 + £, (52) 

where X is an n x d design matrix and £ ^ N{0, a'^In)- Let /3 = 7/0'^ and b{6) = a'^9'^/2, 9 G 
R. Then the quasi-log-likelihood of the sample defined in (JH) becomes 

£„(y, (3) = y^X/3 - l^b(X/3) " ^ " ^ log -^ - ^^ 

lly — X7JP n 2 nlog27r 
- Trlogc^ n • 



2cj2 2 

Maximizing ^„(y,/3) with respect to 7 yields the least squares estimator 7 = (X X)~"^Xy. 

II ~V ^ II 2 

We estimate a"^ by using the residual sum of squares (RSS), a"^ = ^ n.-d • Then we obtain 
an estimate (3^ = 7/3"^. Since b'{9) = a'^9 and b"{9) = u^, we have 

A„ = a2x^X and B„ = X^diag {[y - X7] o [y - X7]} X, (53) 

where o denotes the Hadamard (componentwise) product. An interesting specific case is 
when the true model is indeed linear, but may involve a different set of covariates. Then 
B„ = r^X X, where r^ is the true variance of Y had we known the model precisely. The 
SIC with index 7 = 1/2 just involves an additional term penalizing the inflation of the error 
variance due to model misspecification. 

B.2 Logistic regression 

In this model, b{9) = log(l + e^), € R and the quasi-log-likelihood of the sample is 
£„(y,/9) = y X/3 — 1 b(X/3), where we ignored the last constant term in (JH). By defini- 
tion in{y,-) attains its maximum at /3„. Since b'{9) = -^-^ and b" (9) = t-^f^e\2 1 matrices 
A„ and B„ are given in m and m with /i(x3„) = {j^,--- , j^V , ^^X) = 
diag{(Yq$7j2r-- , ^i+Xy }, and (^i,--- ,9n)^ = x3n- 

B.3 Poisson regression 

In this model, b{9) = e^, 6* G R and the quasi-log-likelihood of the sample is ^„(y,/9) = 
y^X/3 — l-^b(X/3), where we ignored the last constant term in ([1]). By definition iniyr) 
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attains its maximum at /3„. Since b'{9) = e^ and b"{6) = e^ , matrices A„ and B„ are given 
in (mD and ^ with /x(x3„) = (e'^i,--- ,e^")'^ and 5](x3„) = diagje'^i , • • • ,e^"}, and 



^ir ■ ■ )^n) — X/3„. 
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